{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "94f41fcc-50a6-4f60-82d8-920a550b36d7",
   "metadata": {},
   "source": [
    "# Notebook 1: Synthetic Data Generation for RAG Evaluation\n",
    "This notebook demonstrates how to use LLMs to generate question-answer pairs on a knowledge dataset using LLMs.\n",
    "We will use the dataset of pdf files containing the NVIDIA blogs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1172aafd-f9fb-498c-9fe1-56513150cd09",
   "metadata": {},
   "source": [
    "![synthetic_data](imgs/synthetic_data_pipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6942e6fb-0c42-438e-baa6-238b51cb4caf",
   "metadata": {},
   "source": [
    "## Step 1: Load the PDF Data\n",
    "\n",
    "[LangChain](https://python.langchain.com/docs/get_started/introduction) library provides document loader functionalities that handle several data format (HTML, PDF, code) from different sources and locations (private s3 buckets, public websites, etc).\n",
    "\n",
    "LangChain Document loaders  provide a ``load`` method and output a piece of text (`page_content`) and associated metadata. Learn more about LangChain Document loaders [here](https://python.langchain.com/docs/integrations/document_loaders).\n",
    "\n",
    "In this notebook, we will use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a pdf of NVIDIA blog post."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f21e893a-01ed-4f06-a449-1be8e3500d2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!unzip dataset.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5f1c5b9-f75a-4445-9d46-bdb740dfafda",
   "metadata": {},
   "outputs": [],
   "source": [
    "# take a pdf sample\n",
    "pdf_example='dataset/RGVsbCBUZWNoIDUvMjMvMjMucGRm.pdf'\n",
    "\n",
    "# visualize the pdf sample\n",
    "from IPython.display import IFrame\n",
    "IFrame(pdf_example, width=900, height=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ecfb361-b48d-43a2-ad07-1bffdf2528d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the relevant libraries\n",
    "from langchain.document_loaders import UnstructuredFileLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14acaabd-f8f7-4ab4-af51-7a745ac021e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the pdf sample\n",
    "loader = UnstructuredFileLoader(pdf_example)\n",
    "data = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "374f1ebd-71a1-46fb-a8ec-df0954c7c1c1",
   "metadata": {},
   "source": [
    "## Step 2: Transform the Data \n",
    "\n",
    "The goal of this step is tp break large documents into smaller **chunks**. \n",
    "\n",
    "LangChain library provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as `text splitters`. In this example, we will use the generic [``RecursiveCharacterTextSplitter``](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter), we will set the chunk size to 3K and overlap to 100. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a151567d-ed74-40c5-9f15-939c3317a378",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the relevant libraries\n",
    "from langchain.text_splitter import  RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87f7966c-16be-4553-8511-c9fb6188527b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# instantiate the RecursiveCharacterTextSplitter\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=100)\n",
    "\n",
    "# split the loaded pdf sample\n",
    "all_splits = text_splitter.split_documents(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87a64ab8-e75d-4c5b-82ef-604612c80679",
   "metadata": {},
   "source": [
    "Let's check the number of chunks of the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7279e523-651b-489f-8544-e2c20543d94b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the number of chunks\n",
    "len(all_splits)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d82e03dc-7e77-475e-900d-2903980745d7",
   "metadata": {},
   "source": [
    "Let's check the first chunk of the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d54c1553-779c-43f6-b5e7-0fce51408cd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the first chunks\n",
    "all_splits[0].page_content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a7baf11-f150-4197-be47-50c2f8f05847",
   "metadata": {},
   "source": [
    "## Step 3: Generate Question-Answer Pairs\n",
    "\n",
    "\n",
    "**Instruction prompt:**\n",
    "```\n",
    "Given the previous paragraph, create one very good question answer pair.\n",
    "Your output should be in a json format of individual question answer pairs.\n",
    "Restrict the question to the context information provided.\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a21f6ca3-5553-4237-8d0d-5e44a668032a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set the instruction_prompt\n",
    "instruction_prompt = \"Given the previous paragraph, create two very good question answer pairs. Your output should be in a json format of individual question answer pairs. Restrict the question to the context information provided.\"\n",
    "\n",
    "# set the context prompt\n",
    "context = '\\n'.join([all_splits[0].page_content, instruction_prompt])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2287a97b-8153-49de-86d6-3bf1104025d1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# check the prompt\n",
    "print(context)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e876573-28bf-4435-9fe0-53871d877892",
   "metadata": {},
   "source": [
    "#### a) AI Playground LLM generator\n",
    "\n",
    "**NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. follow the steps <a href=\"https://github.com/NVIDIA/GenerativeAIExamples/blob/main/docs/rag/aiplayground.md\">here</a>. \n",
    "\n",
    "We are going to use theAI playground'ss `llama2-70B `LLM to generate the Question-Answer pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8c8f009-fd13-4ab4-9524-7ee7419f8be2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the relevant libraries from langchain\n",
    "from langchain_nvidia_ai_endpoints import ChatNVIDIA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4defd23-c8b3-48ff-9be5-d92583b0b014",
   "metadata": {},
   "source": [
    "Let's now use the AI Playground's langchain connector to generate the question-answer pair from the previous context prompt (document chunk + instruction prompt). Populate your API key in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6639ca2f-1f9f-43c6-9079-a98708f32ace",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['NVIDIA_API_KEY'] = \"nvapi-*\"\n",
    "\n",
    "llm = ChatNVIDIA(\n",
    "    model=\"llama2_70b\",\n",
    "    temperature=0.2,\n",
    "    max_tokens=300\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "548d2512-023c-4a48-83ff-257a4787c97c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the output\n",
    "answer = llm.invoke(context)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81fdfd8a-6516-45a7-9589-6bc73d2f2ee7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(answer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b37bef8f-96c3-444c-9131-71c236f79cd8",
   "metadata": {},
   "source": [
    "# End-to-End Synthetic Data Generation\n",
    "\n",
    "We have run the above steps and on 600 pdfs of NVIDIA blogs dataset and saved the data in json format below. Where gt_context is the ground truth context and gt_answer is ground truth answer.\n",
    "\n",
    "```\n",
    "{\n",
    "'gt_context': chunk,\n",
    "'document': filename,\n",
    "'question': \"xxxx\",\n",
    "'gt_answer': \"xxxx\"\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f47cb4b3-43aa-4c95-969c-d48a66310d72",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "with open(\"qa_generation.json\") as f:\n",
    "    dataset = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10038acc-347e-4189-a653-71b4b81f52df",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3300a787-4d63-4eea-a3e3-a9888c755f39",
   "metadata": {},
   "source": [
    "# Synthetic Data Post-processing \n",
    "\n",
    "So far, the generated JSON file structure embeds `gt_context`, `document`, the `question` and `gt_answer` pair.\n",
    "\n",
    "In order to evaluate Retrieval Augmented Generation (RAG) systems, we need to add the RAG results fields (To be populated in the next notebook):\n",
    "   - `contexts`: Retrieved documents by the retriever \n",
    "   - `answer`: Generated answer\n",
    "\n",
    "The new dataset JSON format should be: \n",
    "\n",
    "```\n",
    "{\n",
    "'gt_context': chunk,\n",
    "'document': filename,\n",
    "'question': \"xxxxx\",\n",
    "'gt_answer': \"xxx xxx xxxx\",\n",
    "'contexts':\n",
    "'answer':\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
